Developing machine learning algorithms for dynamic estimation of progression during active surveillance for prostate cancer

Active Surveillance (AS) for prostate cancer is a management option that continually monitors early disease and considers intervention if progression occurs. A robust method to incorporate “live” updates of progression risk during follow-up has hitherto been lacking. To address this, we developed a deep learning-based individualised longitudinal survival model using Dynamic-DeepHit-Lite (DDHL) that learns data-driven distribution of time-to-event outcomes. Further refining outputs, we used a reinforcement learning approach (Actor-Critic) for temporal predictive clustering (AC-TPC) to discover groups with similar time-to-event outcomes to support clinical utility. We applied these methods to data from 585 men on AS with longitudinal and comprehensive follow-up (median 4.4 years). Time-dependent C-indices and Brier scores were calculated and compared to Cox regression and landmarking methods. Both Cox and DDHL models including only baseline variables showed comparable C-indices but the DDHL model performance improved with additional follow-up data. With 3 years of data collection and 3 years follow-up the DDHL model had a C-index of 0.79 (±0.11) compared to 0.70 (±0.15) for landmarking Cox and 0.67 (±0.09) for baseline Cox only. Model calibration was good across all models tested. The AC-TPC method further discovered 4 distinct outcome-related temporal clusters with distinct progression trajectories. Those in the lowest risk cluster had negligible progression risk while those in the highest cluster had a 50% risk of progression by 5 years. In summary, we report a novel machine learning approach to inform personalised follow-up during active surveillance which improves predictive power with increasing data input over time.

. Note that irregular time intervals between observations can be generally described by the actual timestamps & $ .
Define B ∈ ℝ 45 be a random variable for the time-to-event and C ∈ ℝ 45 be a random variable for the time-to-censoring. We assume that B, C are drawn from a conditional distribution that depends on the history of a patient's longitudinal observations, respectively, and we only observe the event or the censoring that occurs first, i.e., Δ = D {EFG} and , = min(B, C). Then, the conditional hazard function ℎ"LM?(& > )' [6], which represents the instantaneous risk of the outcome event occurring given the history ?(& > ), can be defined as: Now, we assume that the conditional hazard functions follow the Weibull distribution [2], which is one of the most common parametric forms to analyze time-to-event processes. That is, given the history (1) can be simplified as: where a"?(& > )' > 0 is the conditional intensity function given ?(& > ) and `> 0 is the shape parameter. 1 Then, given a clinical pathway ?(& > ), we can derive the risk of having an event occur at or before time L elapsed since the last observation time & > as 1 The Weibull distribution is a generalization of the exponential distributions. For instance, when `= 1, it reduces to the standard exponential distribution and has constant hazard function over time, while the hazard function is i ncreasing and decreasing over time when `> 1 and `< 1, respectively. (3) The risk, f"LM?(& > )', denotes the probability of an event occurring at or before time L given the input pathway up to timestamp h. It is worth highlighting that whenever a new observation is collected Dynamic-DeepHit-Lite re-issues the risk predictions that start from 0 due to the fact that this patient is alive at the time at which the new observation is collected. Utilizing the Gated Recurrent Unit (GRU) [7], v > can be derived as follows: where y, {, and ~ are weight matrices and vector which parameterize the encoder, ∘ is element-wise multiplication, x(⋅) is the sigmoid function, and &ÇÉℎ(⋅) is the tangent function.
The predictor, n Ü : ∏ t → ℝ s5 , is a fully-connected network (parameterized by á) that estimates the conditional intensity functions in (2)  patients' clinical pathways into temporal clusters that share similar time-to-event predictions. More specifically, we formalize temporal clustering defined in [5] as learning discrete representations that best characterize the underlying time-to-event process learned by Dynamic-DeepHit-Lite through the pathways. The key insight here is that learning embeddings (i.e., a finite number of latent representations available for discrete representation learning) and the mappings from pathways to these embeddings can be viewed as learning the centroids of each cluster (i.e., the representative representations of each cluster) and the assignments of the pathways to these clusters, respectively.
Let L > ∈ {1, … , ã} be the cluster assignment at time step h and ℰ = {ç(1), … , ç(ã)} where ç(é) ∈ t be the embedding dictionary. Then, we define v è > ≝ ç(L > ) ∈ t to be the embedding, a discrete Given the conditional intensity functions given the cluster assignment and those estimated by the trained Dynamic-DeepHit-Lite, we can compute the JS-divergence between the two time-to-event processes as the following 2 : .W"a||a ̅ ' = Finally, we replace the loss functions in [5] with the newly defined divergence (5)  Section F: Partial dependence plot to determine the order of contributing variables on cluster movement A partial dependence plot was used to change the value of each variable while fixing the values of other variables to see how the assigned temporal cluster changes [8]. Since the three variables -PSA, MRI Stage, and Grade -are not in the same scale and with different categories, we plotted the average effect on the cluster status in Figure F1-F3. In these figures, the transition frequency implies the frequency of making a transition to a higher risk cluster (e.g., from Cluster 2 to Cluster 3) when positive, and that of making a transition to a lower risk cluster (e.g., from Cluster 2 to Cluster 1) when negative.
As can be seen in the figure, the order of most contributing variable on the status of temporal cluster is Grade, PSA, and Stage. 2 We use JS-divergence between the two Weibull distributions instead of using KL-divergence in the original AC-TPC [5] due to the symmetric property.  hazards [8], [9] at baseline (using static covariates only) and landmarking Cox [10], [11] (using both static and temporal covariates up to the prediction times) in the dynamic setting. The full set of features is used, as with Dynamic-DeepHit-Lite, and the regularization parameter ù is set to 1e-3. For the landmarking Cox, we set the landmarking times as û = 0, 1, 2, and 3 years.
For evaluating discriminative performance, we use time-dependent concordance index for rightcensored data based on inverse probability of censoring weights [13] throughout; for calibration performance evaluation, time-dependent Brier score [14] is computed.
When evaluating the discriminative performance to compare the discovered clusters and simple stratification approaches, we use the time-dependent concordance index as above, with cluster index or stratum index (higher index corresponding to higher risk group) used as risk estimates. The time-toevent models, including DDHL, LM-Cox, and static Cox, make risk predictions whose value is between 0 and 1. Thus, we use those outcomes directly to evaluate the discrimination and prediction performance. In contrast, the clustering methods, including AC-TPC and Canary-PASS, predicts to which cluster a patient belongs based on his longitudinal observations. We compared the "discriminative power" of the two clustering methods by using the average predicted risks for each cluster, as an indirect way to compare how similar the patients are within a cluster and how dissimilar the patients are across different clusters. Considering the description above, we provided the best comparison that we could to compare the two clustering methods. More specifically, when building the Canary-PASS model, we first trained a LM-Cox using the same training set; in particular, time-to-event information was also provided to build Canary-PASS model during training. To provide a fair comparison with respect to the discriminative power of the discovered groups (i.e., clusters in the proposed method and stratifications of Canary-PASS model), we wanted to match the numbers of groups. Hence the results of the Canary-PASS as a 4 strata model, which is the same number of clusters discovered by our method, were utilized.